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Understanding patterns of spontaneous mutations is of fundamental interest in studies of human genome evolution and 
genetic disease. Here, we used extremely rare variants in humans to model the molecular spectrum of single-nucleotide 
mutations. Compared to common variants in humans and human-chimpanzee fixed differences (substitutions), rare 
variants, on average, arose more recently in the human lineage and are less affected by the potentially confounding effects 
of natural selection, population demographic history, and biased gene conversion. We analyzed variants obtained from 
a population-based sequencing study of 202 genes in >14,000 individuals. We observed considerable variability in the per- 
gene mutation rate, which was correlated with local GC content, but not recombination rate. Using >20,000 variants with 
a derived allele frequency slO 4 , we examined the effect of local GC content and recombination rate on individual variant 
subtypes and performed comparisons with common variants and substitutions. The influence of local GC content on rare 
variants differed from that on common variants or substitutions, and the differences varied by variant subtype. Fur- 
thermore, recombination rate and recombination hotspots have little effect on rare variants of any subtype, yet both have 
a relatively strong impact on multiple variant subtypes in common variants and substitutions. This observation is con- 
sistent with the effect of biased gene conversion or selection-dependent processes. Our results highlight the distinct biases 
inherent in the initial mutation patterns and subsequent evolutionary processes that affect segregating variants. 

[Supplemental material is available for this article.] 



Mutation is one of the most fundamental processes in biology. It is 
the ultimate source of genetic variation and one of the driving 
forces of evolution. Mutation also plays a significant role in the 
etiology of human diseases. There is considerable interest in un- 
derstanding the underlying pattern and molecular spectrum of 
spontaneous mutations. Historically, two approaches were applied 
to estimate the single-nucleotide mutation rate in humans. The 
first analyzes divergent sites between humans and another species, 
typically chimpanzee. According to Kimura's neutral theory, the 
majority of substitutions are neutral and therefore the extent of 
between-species divergence can be used to estimate the neutral 
mutation rate (Kimura 1983). Many groups have applied this ap- 
proach to estimate the spontaneous mutation rate in humans 
(Drake et al. 1998; Nachman and Crowell 2000; Kumar and Sub- 
ramanian 2002; Silva and Kondrashov 2002). However, several 
forces, including natural selection, biased gene conversion (BGC), 
and demographic history, can alter fixation probabilities and re- 
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shape the spectrum and genomic distribution of between-species 
substitution patterns. A second, more direct approach, pioneered 
by Haldane (1935), uses incidence rates of dominant disorders in 
humans to estimate the mutation rate (Sommer 1995; Sommer and 
Ketterling 1996; Kondrashov 2003; Lynch 2010). This approach, 
however, is limited by the fact that only a small subset of new 
mutations manifest as disease variants (Nachman 2004). 

The mutation rates from these studies represent a genome- 
wide average. However, there is extensive variability among differ- 
ent genes or genomic regions in both between-species divergence 
and within-species diversity (Wolfe et al. 1989; Nachman and 
Crowell 2000; Sachidanandam et al. 2001; Smith and Lercher 
2002; Kondrashov 2003; Hodgkinson et al. 2009). This suggests 
that spontaneous mutation rates are not constant throughout the 
genome, although the reasons behind this variability are unclear. 

Local nucleotide composition is a frequently studied feature 
that could contribute to mutation rate variability. One study 
showed that AT > GC (an A base replaced with a G or a T base 
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replaced with a C) common variants segregate at a higher frequency 
in regions with higher GC content (Webster et al. 2003), and others 
similarly reported increased fixation bias toward GC base pairs in 
GC-rich regions (Lercher and Hurst 2002a; Lercher et al. 2002). 
However, analyses of GC content and variant patterns often 
reported contradicting findings. For example, while some studies 
showed that GC content is positively correlated with both di- 
vergence rates between humans and chimpanzee (Smith et al. 2002; 
Webster et al. 2003; Arndt and Hwa 2005; Duret and Arndt 2008) 
and within-human nucleotide diversity (Sachidanandam et al. 
2001; Hellmann et al. 2005), another study found a negative cor- 
relation (Cai et al. 2009). Furthermore, while some studies reported 
increasing GC > AT substitution rates with increasing GC content 
(Smith et al. 2002; Webster et al. 2003), others showed a decrease 
(Arndt and Hwa 2004; Duret and Arndt 2008). These inconsistencies 
could be partly explained by differences in the allele frequency, and 
therefore the evolutionary time scale of the variants analyzed in 
different studies. Consequently the observed patterns could be the 
result of confounding factors, such as selection and demography, 
instead of alterations in the actual mutation rate. 

Recombination is known to influence patterns of common 
variation and substitution rates. Correlations between recom- 
bination rate and nucleotide diversity or between species sub- 
stitution rates have been observed in humans (Nachman et al. 1998; 
Nachman 2001; Lercher and Hurst 2002b; Hellmann et al. 
2003, 2005; Spencer et al. 2006; Duret and Arndt 2008; Cai et al. 
2009; Lohmueller et al. 2011), Drosophila (Begun and Aquadro 
1992; Begun et al. 2007; Kulathinal et al. 2008), and several plant 
species (Dvorak et al. 1998; Kraft et al. 1998; Stephan and Langley 
1998; Tenaillon et al. 2004). Three major theories exist to explain 
these observations. First, recombination may be directly muta- 
genic, leading to increased mutation rates in regions of high re- 
combination and thus higher diversity (Lercher and Hurst 2002b; 
Hellmann et al. 2003, 2008). Second, while background selection 
and selective sweeps reduce haplotype diversity, recombination 
generates new haplotypes by shuffling variants onto different 
backgrounds, thereby maintaining diversity in regions of high 
recombination rates (Kaplan et al. 1989; Charlesworth et al. 1993, 
1995; Hudson and Kaplan 1995; Nachman 2001). A third expla- 
nation is BGC, a recombination-associated process that preferen- 
tially repairs AT/GC mismatches produced during recombination 
to GC bases, leading to preferential fixation of GC alleles (for re- 
view, see Duret and Galtier 2009). Over time, the observed effect of 
BGC can mimic that of natural selection, leading to an excess of 
"weak" (W) A/T bases converted to "strong" (S) G/C bases as if the 
latter were under positive selection (Berglund et al. 2009; Galtier 
et al. 2009; Necsulea et al. 2011). The reports hypothesizing a 
mutagenic effect of recombination relied on common variants and 
substitutions (Lercher and Hurst 2002b; Hellmann et al. 2003, 
2008). Several lines of evidence argue against the mutagenic re- 
combination theory and instead suggest that a selection-dependent 
mechanism or BGC can explain the observed correlation between 
diversity and recombination rate (Duret and Arndt 2008; Berglund 
et al. 2009; Galtier et al. 2009; Lohmueller et al. 2011). 

Previous studies using common variants within humans and 
substitutions between humans and chimpanzees are effectively 
dealing with mutations accumulated over many generations. Their 
patterns, therefore, reflect the cumulative influence of many pro- 
cesses, including natural selection, population demographic history, 
and BGC. A major challenge in the field is to elucidate the extent to 
which these forces alter the distribution of variants over time and to 
distinguish their relative contributions. To minimize the effects of 



selection, many studies restrict their analysis to noncoding regions of 
the genome. However, widespread signatures of recent positive selec- 
tion, even within supposedly neutral regions (Williamson et al. 2007), 
suggest that noncoding regions may also be influenced by selection. 

Rare variants represent a newly available and expanding re- 
source that can overcome some of these limitations. Rare variants are 
relatively young, predominantly because they are the result of recent 
mutation events. Therefore, rare variants are typically less affected 
by population demographic history or natural selection (Messer 
2009). Furthermore, as BGC acts only on variants after they have 
arisen in the population (Duret and Galtier 2009), it does not 
influence innate mutation rates. Rare variants, therefore, are an 
appropriate resource for studying the spectrum and genomic distri- 
bution of mutations while minimizing the potentially confounding 
influences. In addition, while family-based whole-genome se- 
quencing has begun to identify de novo mutations that provide 
more direct measures of mutation rates (The 1000 Genomes Project 
Consortium 2010; Conrad et al. 2011; Campbell et al. 2012; Kong 
et al. 2012), the identified mutations sparsely cover the genome. For 
example, if whole-genome sequencing of each parent-offspring trio 
yields —40 de novo mutations (Conrad et al. 2011), 500 such trios 
would need to be sequenced to accumulate roughly 20,000 muta- 
tions. These mutations, however, would occur once per 150 kb on 
average, and the data would lack the spatial resolution necessary to 
detect the effect of local genomic context on a finer scale. 

We studied a set of rare variants discovered via targeted rese- 
quencing of 202 genes in >14,000 unrelated individuals. We ana- 
lyzed the per-gene mutation rate as well as the probability of each 
site to contain a variant of a specific subtype relative to local GC 
content, recombination rate, and recombination hotspots. In or- 
der to compare mutation rate inferences based on rare variants 
with those obtained by within- and between-species data, we 
compared rare variant patterns to common variant data from The 
1000 Genomes Project Consortium and substitution sites between 
humans and chimpanzee. These three variant classes cover dif- 
ferent evolutionary time scales, and the differences between them 
allow us to examine the distinct influence of genomic context on 
the initial mutation process, the subsequent rise of some muta- 
tions to become common variants, and eventual fixation. 

Results 

Variant counts and densities among rare variants, common 
variants, and substitutions 

We obtained rare variants from a previously described sequencing 
study targeting the exons and flanking intronic regions of 202 
genes in >14,000 individuals to a median depth of 27x (Nelson 
et al. 2012). The genes are drug targets relevant in 12 complex 
diseases; and the subjects were recruited for genetic association 
studies of these diseases (Nelson et al. 2012). Several complemen- 
tary methods were used to assess the quality of rare variants in 
these data. Among singleton variants, the false positive and neg- 
ative rates were estimated to be 2.0% and 2.7%, respectively, with 
lower error rates estimated for more common variants (Nelson 
et al. 2012). For this study we focused on the 195 autosomal genes, 
with —700 kb targeted regions in —2000 targeted exons, which 
contained a total of 20,053 rare variants with a derived allele fre- 
quency (DAF) £10~ 4 in the European subset (N = 12,515). Each 
variant was categorized into one of seven possible variant subtypes 
based on the ancestral and derived allele states: AT > GC, GC > AT, 
CpG GC > AT, AT > CG, GC > TA, AT > TA, and GC > CG (Table 1). The 
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Table 1. Variant counts and conditional variant proportions across variant subtype for rare variants, common variants, and substitutions 



Transitions Transversions 



Variant type 


AT > GC 


CpG GC > AT 


GC > AT 


AT > CG 


GOTA 


AT > TA 


GOCG 


Total 


Ti/Tv 


W > S/S > W 


Rare variants 


4778 


3951 


5338 


1215 


1796 


1023 


1952 


20,053 


2.35 


0.54 




(1 .28%) 


(12.8%) 


(1.71%) 


(0.32%) 


(0.52%) 


(0.27%) 


(0.57%) 


(2.79%) 






Common variants 


6060 


3684 


5845 


1519 


2078 


1261 


2119 


22,566 


2.23 


0.65 




(0.10%) 


(1 .08%) 


(0.11%) 


(0.025%) 


(0.038%) 


(0.021%) 


(0.038%) 


(0.19%) 






Substitutions 


6154 


2805 


5815 


1679 


2092 


1183 


2184 


21,912 


2.07 


0.73 




(0.27%) 


(2.18%) 


(0.30%) 


(0.075%) 


(0.10%) 


(0.053%) 


(0.11%) 


(0.51%) 







Counts of all variant subtypes across rare variants, common variants, and substitutions are shown. Conditional variant proportion for each variant 
subtype, defined as the number of observed variants divided by the number of bases that could give rise to the given variant, is shown below in 
parentheses. W > S/S > W was defined as the total number of weak to strong (W > S) variants divided by the total number of strong to weak (S > W) 
variants, including CpG GC > AT variants. Ti/Tv is the ratio of transitions to transversions. CpG-induced GC > TA and GC > CG variants are included in the 
GC > TA and GC > CG variant subtypes, respectively. 



notation of AT > GC indicates a site where the ancestral base A has 
a G as the derived allele, or ancestral base T has a derived allele C. 

We summarized variant counts by subtype (Table 1). Nearly 
13% of CpG sites have a rare GC > AT variant, compared with only 
1.71% of non-CpG GC bases, consistent with the known hyper- 
mutability of CpG dinucleotides (Nachman and Crowell 2000; 
Kondrashov 2003; Hwang and Green 2004). Among rare variants, 
there were nearly twice as many S > W variants (those converting a 
G/C base pair into an A/T base pair) as the opposite W > S variants 
(Table 1). This mutational AT bias is consistent with previous obser- 
vations (Lynch 2010), and can be mainly explained by the relatively 
high frequency of GC > AT variants at CpG dinucleotides (Table 1). 

For comparison, we also analyzed common variants and 
substitutions. We randomly sampled intergenic regions from the 
human genome to obtain common variants and substitutions for 
analysis while matching the genomic context of the rare variant 
data set (see Methods). Sampling intergenic regions allowed us to 
minimize effects of selection. To achieve comparable statistical 
power, we sampled a similar number of common variants and 
substitutions as the rare variants. In all, we obtained 22,566 var- 
iants from the European subset of The 1000 Genomes Project 
Consortium with a DAF > 5% and 21,912 human-lineage-specific 
divergent sites between humans and chimpanzee (Table 1). 

The relative proportion of variant subtypes differed among 
the three variant classes. Figure 1 shows the total variant proportion, 
defined as the number of variants of a given subtype over the total 
number of variants in that variant class. The relative proportion of 
AT > GC variants increased progressively from rare variants to sub- 
stitutions, while CpG GC > AT transitions correspondingly de- 
creased (Fig. 1). Other variant subtypes showed little change across 
the three variant classes. These observed proportions, however, are 
influenced by the different sets of sites analyzed for rare variants and 
for common variants/substitutions, and cannot be directly inter- 
preted as the relative mutation rates across subtypes. For example, 
mutation rates at CpG sites are affected by methylation status, yet 
sites near promoters tend to be hypomethylated compared with 
intergenic regions (Molaro et al. 2011). 

The conditional variant proportion, defined as the number 
of a given variant subtype divided by the total number of bases 
that could produce the given subtype, was higher in all rare var- 
iant subtypes compared with common variants and substitutions 
(Table 1). The higher "absolute" conditional variant proportion 
in rare variants is expected, as the rare variants were discovered in 
>12,000 individuals. Importantly, the "relative" magnitudes across 
rare variant subtypes are expected to more closely reflect the relative 



spontaneous mutation rate than common variants or substitu- 
tions. The results for rare variants in Table 1, therefore, provide 
more accurate estimates of the relative mutation rates among 
different mutation subtypes. 

The per-gene mutation rate was influenced by GC content 
but not recombination rate 

We analyzed the per-gene mutation rate for 193 genes (out of the 
195 autosomal genes), calculated previously by Nelson et al. 
(2012), using the method described by Coventry et al. (2010) and 
Wakeley and Takahashi (2003). There were considerable fluctua- 
tions in the mutation rate across genes (Fig. 2A). To assess the 
impact of genomic context on this variability, we calculated aver- 
age GC content and recombination rate within the transcribed 
region of each gene (Fig. 2B and C, respectively). There was a weak 
but significant positive correlation between mutation rate and GC 
content (Pearson's r = 0.22, P = 0.0031) (Fig. 2D, dashed line). Re- 
combination rate, however, was not significantly correlated with 
mutation rate (Pearson's r = 0.039, P = 0.60) (Fig. 2E, dashed line). 
To ensure that outliers did not drive these results, we excluded 
genes that fell outside of two standard deviations from the mean 



AT>TA 




AT>CG 
GOAT 

CpG GC>AT 
AT>GC 



£ = 
o £ 

.a 

Figure 1. Comparison of total variant proportions of the seven variant 
subtypes across the three variant classes. The total variant proportion is 
shown for each of the seven variant subtypes, defined as the number of 
variants of a given subtype over the total number of variants in that variant 
class. The three variant classes were rare variants, common variants, and 
substitutions. 
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with mutation rate (GC content P-value = 
0.002, recombination rate P-value = 0.66). 

Using logistic regression to analyze 
per-site variant patterns 

The per-gene mutation rates analyzed 
above were calculated using all variant 
subtypes in aggregate; however, previous 
studies suggest that GC content and re- 
combination rate may have different ef- 
fects on specific variant subtypes (Lercher 
and Hurst 2002a; Arndt et al. 2005; Duret 
and Arndt 2008; Berglund et al. 2009). 
Estimating subtype-specific mutation 
rates on a per-gene or per-exon basis lacks 
a sufficient number of sites, especially for 
subtypes with relatively few observed 
variants (such as transversions). Therefore, 
we combined the —700 K targeted sites 
over all 195 genes, using a per-site logistic 
regression strategy to examine the effect of 
local GC content and recombination rate 
on the probability of observing a variant of 
a given subtype (see Methods). 

The dependent variable of the lo- 
gistic regression was obtained by scoring 
each site as either variant or invariant. If 
the site was scored as variant, it was fur- 
ther categorized into one of seven variant 
subtypes based on the ancestral and de- 
rived alleles. The log odds of a site being 
variant was regressed on GC content and 
recombination rate, calculated in 1-kb 
windows surrounding each individual 
site. 

GC content affected rare variants 
differently from common variants 
and substitutions 



Figure 2. Variability of mutation rates across 193 genes and relationships with genomic context. 
(A) Per-gene mutations rates (x 1 CP 7 per base pair per generation) for 1 93 genes, estimated previously 
by coalescent modeling (Nelson et al. 201 2), are shown ordered from lowest to highest. The black line 
indicates the average of 1 93 genes (1 .02 x 1 0~ s per base pair per generation). (8) Per-gene average GC 
content ordered as in A. (C) Per-gene average recombination rate (log qo cM/Mb) ordered as in A. 
(D) Relationship between GC content and mutation rate. The dashed line represents the linear re- 
gression fitting. After removing outliers (gray filled points), the regression was recalculated (dotted line). 
(£) Relationship between recombination rate (log qo cM/Mb) and mutation rate. The dashed line rep- 
resents the linear regression fitting. Outliers were removed (gray filled points) and the regression was 
recalculated (dotted lines). 



GC content or mutation rate (N = 8) and the recombination rate 
(N = 10). There was a slight increase in the correlation with GC 
content and little in the correlation with recombination rate 
(dotted line in Fig. 2D and 2E, respectively). As previously reported 
(Kong et al. 2002), GC content and recombination rate themselves 
are positively correlated (Pearson's r = 0.18, P = 0.017). Multiple 
linear regression including both GC content and recombination 
rate as covariates did not change the results from either regression 
alone, and recombination rate was still not significantly correlated 



Overall, the probability of observing any 
rare variant was positively influenced by 
GC content (|3 = 0.68, P-value < 10~ 16 ). 
However, individual subtypes showed 
mostly negative or relatively small positive 
effects of GC content (Fig. 3). The obser- 
vation that individual subtypes could 
show opposite regression results to all 
variants combined may seem counter- 
intuitive, but is an example of Simpson's 
Paradox, where trends observed in sub- 
sets of the data can be the opposite of 
what is observed in the entire data set (Agresti 2002). CpG-in- 
duced GC > AT variants, one of the major variant subtypes, tended 
to lie in GC-rich regions (50%-65% GC content), whereas AT > GC 
transitions tended to occur in GC-poor regions (30%-45% GC con- 
tent) (Supplemental Fig. 1). The unbalanced distribution of GC 
content across variant subtypes, combined with the much higher 
mutation rate at CpG dinucleotides, drove the observed positive 
slope for all variants combined (Supplemental Fig. 1). When all 
CpG sites (variant or invariant) were removed and the regression 
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Figure 3. Regression results for CC content across variant subtypes for 
rare variants, common variants, and substitutions. The relationship be- 
tween local GC content and the observed conditional variant proportion 
for seven variant subtypes: (A) AT > GC, (fi) AT > CG, (C) AT > TA, (D) CpG 
GC > AT, (£) GC > AT, (F) GC > TA, and (C) GC > CG. Filled points show the 
conditional variant proportions in each GC content bin, scaled by the 
intercept of the logistic regression^^^e", where a is the intercept calcu- 
lated in the regression, n x> y is the count of the given X> Y variant sub- 
type, and Nxs is the number of X ancestral invariant sites that could 
produce the given subtype in the i th GC content bin. Symbol size rep- 
resents the proportion of the given variant subtype falling into a given GC- 
content bin. The solid lines show the fitted logistic regression curve, where 
P is the slope fitted in the logistic regression and Xj is the GC content in the 
(' th bin. The gray dashed line represents the baseline of no effect, /3 = 0. 
Legends in each subplot show the regression slope calculated for each 
variant class and its significance. (***) P-value < 0.0001, (**) P-value < 
0.001, (*) P-value < 0.01. 



studying variant subtypes, as analysis of all variants in aggregate 
could miss the underlying pattern of individual subtypes. 

Comparison of rare variants and common variants or sub- 
stitutions revealed subtype-specific differences among the three 
variant classes (Fig. 3). For AT > GC and AT > CG rare variants, there 
was a relatively strong negative relationship between variant pro- 
portions and GC content (Fig. 3A,B). These same trends, however, 
were not observed in AT > GC and AT > CG common variants or 
substitutions, for which the trends were weaker, and sometimes 
positive (Fig. 3A,B). In contrast, for GC > AT and GC > TA variants, 
there were relatively strong negative effects on common variants 
and substitutions, yet the effects on rare variants were smaller or 
absent (Fig. 3E,F). Together, these results show that GC-rich re- 
gions tend to have fewer W > S rare variants and fewer S > W 
common variants or substitutions than GC-poor regions. There 
was a strong negative effect on CpG GC > AT, consistent across rare 
variants, common variants, and substitutions (Fig. 3D). We also 
observed consistent negative effects on AT > TA (Fig. 3C) and GC > 
CG variants (Fig. 3G) across variant classes. 

Recombination affects patterns of common variants 
and substitutions, but not rare variants 

The influence of recombination rate on total rare variants (B = 0. 15, 
P-value = 3.58 x 1(T 4 ) and individual variant subtypes was rela- 
tively small (Fig. 4). In comparison, the effect was much stronger 
on total common variants (B = 0.95, P-value < 1(T 16 ) and total 
substitutions (S = 0.34, P-value < 10~ 16 ), as well as on all variant 
subtypes (Fig. 4). There was a strong positive effect on W > S 
common variants and substitutions (Fig. 4A,B), consistent with the 
expected impact of BGC on variant patterns in the human ge- 
nome. For the other common variant subtypes (Fig. 4C-G), the 
effect was positive but weaker than W > S variants. In contrast, the 
effect on substitutions was negative (Fig. 4C-F) or slightly positive 
(Fig. 4G). While the positive trends seen in W > S common variant 
subtypes could be explained solely by BGC, the positive effects in 
other subtypes suggest that either selective sweep or background 
selection could also be acting on these variants. Importantly, the 
lack of effect on rare variants suggests that mutation rates are not 
altered by recombination rate. 

Since the deCODE recombination map published in 2002 has 
limited resolution, with 1257 meioses (Kong et al. 2002), we also 
adopted the higher-resolution deCODE map published in 2010, 
with 15,257 meioses and a higher marker density (Kong et al. 
2010), and reanalyzed the effect of recombination rates on variant 
subtypes (Supplemental Table 1). The results are largely consistent 
with the results using the 2002 rates. For rare variants, no subtype 
was strongly affected by the 2010 recombination rates, just like the 
results with the 2002 rates. For substitutions, five of the seven 
subtypes had the same effect direction in the two versions of re- 
combination rate used. For common variants, the regression co- 
efficients were positive for all subtypes, which is what we observed 
with the 2002 rates. Importantly, the AT > GC and AT > CG sub- 
types showed the strongest effects, consistent with the influence 
of BGC. 



run, the relationship between total rare variants and GC content 
became negative and was no longer significant ((3 = -0.17, P-value = 
0.028). A similar finding was noted previously for substitution data 
(Duret and Arndt 2008). These results highlight the importance of 



Recombination hotspots influence common variants, 
but have little effect on rare variants or substitutions 

Previous studies suggested that the distance to a recombination 
hotspot accounts for most of the observed correlation between 
nucleotide diversity and recombination rate (Spencer 2006; 
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Figure 4. Regression results for recombination rate across variant sub- 
type for rare variants, common variants, and substitutions. The relation- 
ship between local recombination rate (log qo cM/Mb) and the observed 
conditional variant proportion for seven variant subtypes: (A) AT > GC, 
(6) AT > CC, (C) AT > TA, (D) CpG GC > AT, (£) GC > AT, (f) GC > TA, and 
(C) GC > CG (plotted as in Fig. 3). Filled points show the conditional 
variant proportions, scaled by the intercept of the logistic regression. 
Symbol size represents the proportion of the given variant subtype falling 
into a given recombination rate bin. The solid lines show the fitted logistic 
regression curve, where /3 is the slope fitted in the logistic regression and*,- 
is the recombination rate in the / th bin. The gray dashed line represents 
the baseline of no effect, /3 = 0. 



Spencer et al. 2006). To examine the effect of recombination hot- 
spots, we calculated for each site its absolute distance to the nearest 
recombination hotspot (DTH) as reported in the population-based 
estimates from the HapMap Project (McVean et al. 2004; Myers et al. 
2005). Median per-site DTH was consistent across all variant classes 
(median and standard deviation of log-transformed absolute DTH 
for rare variants: 4.43 ± 0.65, common variants: 4.32 ± 0.60, and 
substitutions: 4.32 ± 0.60). The regression results using DTH, 
shown in Figure 5, were largely consistent with those for re- 



combination rate (Fig. 4). We observed relatively weak relationships 
between DTH and rare variants for total ((3 = -0.042, P-value = 
1.61 x 10~ 4 ) and all variant subtypes (Fig. 5). The strongest of 
these, GC > AT rare variants, had a negative relationship with DTH, 
but it was weaker than the relationship observed in common 
variants (Fig. 5E). DTH had a negative effect on total common 
variants ((3 = -0.15, P-value < 10~ 16 ) and for each of the seven 
variant subtypes (Fig. 5). For substitutions, however, the negative 
effects were either weaker than those seen for common variants 
(Fig. 5A,D,G) or positive (Fig. 5B,C,E,F). 
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Figure 5. Regression results for DTH across variant subtypes for rare 
variants, common variants, and substitutions. The relationship between 
DTH (log 10 bp)and the seven variant subtypes: (/l)AT>GC, (B) AT > CG, 
(C) AT > TA, (D) CpG GC > AT, (£) GC > AT, (f) GC > TA, and (C) GC > CG 
(plotted as in Fig. 3). Filled points show the conditional variant pro- 
portions, scaled by the intercept of the logistic regression. Symbol size 
represents the proportion of the given variant subtype falling into a given 
DTH bin. The solid lines are the fitted logistic regression curve, where f} is 
the slope fitted in the logistic regression and X; is the DTH in the i th bin. 
The gray dashed line represents the baseline of no effect. 
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An alternative approach to assess the effect of recombination 
hotspots is to compare the variants inside recombination hotspots 
with those outside. Of the 20,053 rare variants, 1636 are inside 
hotspots, and the remaining 18,417 are outside. The conditional 
variant proportion is similar between the inside and outside groups, 
both for all rare variants combined and for individual subtypes 
(Supplemental Table 2), directly suggesting that recombination 
hotspots are not inherently mutagenic. Correspondingly, regression 
analyses of rare variants using inside versus outside of a recom- 
bination hotspot as the independent variable showed very weak 
effects for all subtypes (Supplemental Table 3). For common vari- 
ants, AT > GC and AT > CG variants had two of the strongest positive 
regression results among the common variant subtypes (Supple- 
mental Table 3). This pattern, in addition to the results from the 
recombination rates, is consistent with the effect of BGC. For sub- 
stitutions, the strongest effect is in the AT > GC subtype, with 
a positive regression result, similar to the common variants. 

Validation and robustness of regression results 

To test how well our results for 195 genes compare with data sets 
with more complete exome coverage, we analyzed the variants in 
the European subset (N = 3510) of the Exome Sequencing Project 
(ESP), recently described by Tennessen et al. (2012). The majority 
of the regression results from the 195 genes agreed with the results 
from the whole-exome data (based on 99% confidence intervals) 
(Table 2). In addition, we used several strategies to test the robust- 
ness of our regression results. Since coding variants are more likely 
to be under selection, we separately analyzed the rare variants in 
coding exons (N = 8738) and those in flanking intronic regions (N= 
4642), and saw little difference in regression results (Supplemental 
Tables 4, 5). We also calculated GC content in several window sizes 
and saw little difference in the results (Supplemental Fig. 2). Since 
logistic regression is based on a specific probability model, we 
used permutations to confirm that the model-based P-values for 
rare variants are accurate (Supplemental Table 6), and used sub- 
sampling (Supplemental Fig. 3) and bootstrapping (Supplemental 
Fig. 4) techniques to verify that the magnitude of the rare variant 
regression results were unbiased. We also jointly tested the effect of 
genomic context using multivariate regression models and com- 
pared the results with those of univariate models for rare variants, 
common variants, and substitutions (Supplemental Tables 7-9). 
Additionally, we showed that sequencing coverage has little effect 
on the regression results for rare variants (Supplemental Table 10). 
Finally, to assess the impact of potential errors in the ancestral to 



derived allele orientation, we compared the ancestral sequence 
definition we adopted, based on a four-species sequence align- 
ment, with the naive method of using the chimpanzee reference 
allele as the ancestral allele (Supplemental Methods). Permutation 
analyses showed that even errors on the order of 3%-5%, such as 
those that would be imposed using this naive method, had little 
effect on the results (Supplemental Fig. 5). Detailed information 
regarding these analyses is included in the Supplemental Results 
and Methods. 

Discussion 

In this study, we used rare variants as a model to examine mutation 
patterns of different variant subtypes in the human genome. We 
also used common variants and human-chimpanzee substitutions 
to analyze the ongoing biases toward fixation, involving natural 
selection and neutral evolutionary processes. Our results suggest 
that both mutation rates and fixation biases are affected by local 
GC content. However, fixation processes, and not mutation per se, 
are affected by the recombination rate. 

Using rare variants to analyze spontaneous mutation rates 
was previously suggested in anticipation of the emergence of rare 
variant data from the next-generation sequencing studies (Messer 
2009). Rare variants arose more recently in the population. For 
example, the variants we analyzed, with DAF £ 10~ 4 , arose an 
average of —10 generations in the past, assuming a current pop- 
ulation size of 50,000 individuals and a population growth rate of 
0.001 (Slatkin 2000). In populations undergoing recent expansion 
(Coventry et al. 2010) such low-frequency variants will be even 
younger. As a result, rare variant patterns are primarily governed by 
mutation itself. Unless the force of selection is strong, natural se- 
lection, population demographic history, and BGC will not alter 
the observed patterns of rare variants. 

We considered analyzing synonymous and nonsynonymous 
variants separately to further minimize the effects of natural se- 
lection. However, our logistic regression approach works on 
individual variant and invariant sites. It is difficult to analyze 
synonymous and nonsynonymous variants separately because 
each ancestral allele could mutate to three other nucleotides, and 
one needs to enumerate the potential synonymous and non- 
synonymous variants that could occur at each site. As an alternative, 
we separately analyzed coding and noncoding rare variants and did 
not find any significant difference between these two functional 
classes, consistent with theoretical analysis showing that the effect 
of selection is attenuated among rare variants (Messer 2009). 



Table 2. Regression coefficients for rare variants in the 195 gene data set compared with the ESP whole-exome data set 

GC content Recombination rate DTH 



Variant subtypes 195 genes ESP 195 genes ESP 195 genes ESP 



Total 


0.68 


(0.069) 


0.64 


(0.012) 


0.1 s 


(0.043) 


0.34 


(0.0076) 


-0.042 


(0.011) 


-0.057 


(0.0020) 


AT > GC 


-1.048 


(0.15) 


-0.64 


(0.028) 


0.014 


(0.089) 


0.14 


(0.01 7) 


-0.025 


(0.023) 


-0.057 


(0.0044) 


AT > CG 


-0.56 


(0.29) 


-0.17 


(0.057) 


-0.014 


(0.18) 


0.13 


(0.034) 


-0.060 


(0.044) 


-0.051 


(0.0091) 


AT > TA 


-0.98 


(0.32) 


-0.21 


(0.062) 


-0.065 


(0.19) 


0.21 


(0.037) 


0.023 


(0.049) 


-0.034 


(0.0099) 


CpG GC > AT 


-2.64 


(0.17) 


-3.072 


(0.024) 


-0.13 


(0.10) 


0.17 


(0.014) 


-0.047 


(0.025) 


-0.077 


(0.0039) 


GC > AT 


0.024 


(0.14) 


-0.26 


(0.025) 


0.19 


(0.081) 


0.20 


(0.015) 


-0.089 


(0.021) 


-0.058 


(0.0041) 


GC > TA 


-0.80 


(0.25) 


-0.91 


(0.048) 


0.024 


(0.15) 


0.15 


(0.029) 


-0.054 


(0.039) 


-0.061 


(0.0077) 


GC>CG 


-0.53 


(0.24) 


-0.96 


(0.043) 


0.054 


(0.14) 


0.13 


(0.026) 


0.025 


(0.037) 


-0.055 


(0.0069) 



P coefficients and standard error (in parentheses) for all variant subtypes from the original rare variant analysis in 1 95 genes compared with those from the 
ESP whole-exome sequencing data analysis. Values shown in bold indicate coefficients that are significantly different between the two data sets, based on 
99% confidence intervals (not shown). 
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The average per-gene mutation rate, based on exome se- 
quence data from 193 genes, was 1.02 x 1(T 8 per base pair per 
generation (Nelson et al. 2012), which is similar to recent estimates 
from family-based sequencing studies (The 1000 Genomes Project 
Consortium 2010; Campbell et al. 2012; Kong et al. 2012). 

Previous studies examining the effect of genomic context on 
mutation rate relied on local context measures computed in fixed- 
length genomic windows. This window-based approach is difficult 
to implement in exome sequencing data, because such data cover 
short intervals with variable length, representing targeted exons, 
separated by large gaps, representing introns. This leads to prob- 
lems in defining window width and estimating average parameter 
values. In our study, most target regions are small (85% < 500 bp), 
and calculating rates for low frequency events, such as transver- 
sions, in these windows would be highly inaccurate. We therefore 
adopted a logistic regression approach, using data for individual 
base positions and aggregating data across sites. This approach has 
several advantages. It eliminates the need to account for gaps in 
coverage from intronic and intergenic regions, and provides suf- 
ficient numbers of sites to study the effect of genomic context on 
individual variant subtypes. 

Multiple results suggest that recombination rate has a rela- 
tively small effect on mutation patterns, but a significant impact 
on common variants in the population. First, we did not observe 
a correlation between per-gene mutation rate and recombination 
rate. Second, the effect of recombination rate on rare variant sub- 
types was small, especially when compared with the effect on 
common variants and substitutions. AT > GC and AT > CG com- 
mon variants and substitutions were both strongly affected by 
recombination rate, consistent with the role of BGC altering pat- 
terns of standing variation in the human genome. BGC has no 
effect on mutation rates, but over time, is expected to lead to 
a fixation bias toward GC bases at AT/GC polymorphic sites (Duret 
and Galtier 2009). A recent study reported a strong bias of W > S 
substitutions in human accelerated regions and this bias increased 
with increasing male recombination rate (Berglund et al. 2009). 
Furthermore, BGC can drive deleterious GC alleles to fixation 
(Galtier et al. 2009) and lead to the apparent increase in sub- 
stitution rate with increasing recombination rate (Meunier and 
Duret 2004; Duret and Arndt 2008; Berglund et al. 2009; Galtier 
et al. 2009). These conclusions based on local recombination rates 
were supported by the analysis of recombination hotspots. While 
our results cannot completely rule out a mutagenic effect due to 
recombination, they suggest that if such an effect does exist, it is 
relatively small in comparison to the influence of BGC. 

Background selection and selective sweeps can also drive 
positive correlations between diversity and recombination rate 
(Smith and Haigh 1974; Kaplan et al. 1989; Charlesworth et al. 
1993; Hudson and Kaplan 1995; Cai et al. 2009; Lohmueller et al. 
2011). These selection-dependent mechanisms are unlikely to af- 
fect rare variants because they are too young in the population. In 
addition to the impact of recombination rate on AT > GC and AT > 
CG common variants and substitutions, we also saw relatively 
strong effects on other variant subtypes. Therefore, we cannot rule 
out the effect of these other recombination-associated processes. 

GC content varies throughout the genome, with long 
stretches of DNA exhibiting relatively stable GC content, known as 
isochores (Eyre-Walker and Hurst 2001). Previous studies propose 
that mutation bias or fixation bias drives the apparent regional 
variation in GC content and maintenance of isochores (Smith et al. 
2002; Webster et al. 2003; Duret and Arndt 2008). Our results are 
consistent with this hypothesis, suggesting that GC-rich regions of 



the genome may maintain base composition by simultaneously 
decreasing GC-enriching, W > S, mutations and reducing the fix- 
ation of GC-depleting, S > W, common variants. 

Understanding the relationship between local genomic con- 
text and mutation processes has several practical implications. 
More precise estimates of de novo mutation rates can improve 
genotype calling from short sequencing reads by providing better 
prior distributions for mutation spectrum. Moreover, our results 
can help to identify potentially functional de novo mutations by 
highlighting new variants that are unlikely to arise spontaneously. 

Our study, however, has several limitations. We are not able to 
identify all potential mutations, as some will not be viable in 
humans. However, truly dominant lethal mutations are extremely 
rare and other approaches, including direct discovery of de novo 
variants via trio sequencing, will have similar limitations. Addi- 
tionally, while rare variants are very young on the evolutionary 
time scale, they could still be influenced by the same confounding 
factors that affect common variants and substitutions, albeit to 
a lesser degree. At present, however, rare variants, especially the 
extremely rare variants we study here, represent one of the most 
powerful data sets currently available for high-sensitivity analysis 
of the rate and molecular spectrum of new mutations. Finally, our 
data set involves only 195 genes and could generate a biased rep- 
resentation of the genome. Indeed, these genes appear to be under 
stronger purifying selection than other genes (Nelson et al. 2012). 
Despite this caveat, we observed strong concordance between the 
results from the 195 genes and those from an exome-wide data set, 
indicating that any selection acting on these genes does not in- 
fluence the relationship with genomic context and that our results 
are representative of the exome. 

In conclusion, our data set of >20,000 rare variants (DAF < 
10~ 4 ) represent a valuable resource for studying patterns of single- 
nucleotide mutation in humans. It allows us to take a new step 
toward differentiating the initial mutation processes from the 
subsequent forces that act more gradually, affecting fixation pro- 
cesses of segregating variants. Our results reveal a substantial dif- 
ference in the relative abundance and conditional proportion of 
variant subtypes between rare variants, common variants, and 
substitutions. GC content has a strong impact on all variant clas- 
ses, although the effect is different both among variant classes and 
among different subtypes. Recombination rate, on the other hand, 
has relatively little effect on rare variants, but a much stronger 
effect on AT > GC and AT > CG common variants and sub- 
stitutions, consistent with BGC acting on existing variants. Future 
research, aided by deep sequencing data over more genomic targets 
in larger population samples, will be needed to acquire more pre- 
cise estimates of such fundamental parameters. Eventually, these 
studies will help unravel the relative contribution of diverse evo- 
lutionary forces acting over different time scales. Such an un- 
derstanding will also provide the knowledge necessary to study the 
allelic spectrum of inherited and somatic diseases, as well as the 
dynamics of human genome variation as it evolves under a variety 
of environmental and demographic conditions. 

Methods 

Ethics statement 

All study participants in the component studies provided written 
informed consent for the use of their DNA in genetic studies. A 
careful review was conducted to verify that the consents were 
consistent with the activities of this study. In instances where the 
appropriateness of the informed consent for the current study was 
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not clear, further Institutional Review Board approval was sought 
and obtained. 

Data source and processing 
Rare variants 

We utilized single-nucleotide variants previously described in 
Nelson et al. (2012), which can be accessed at http://www. 
ncbi.nlm.nih.gov/projects/SNP/snp_viewBatch.cgi?sbid=1056695. 
The variants were discovered from a targeted resequencing study 
of the exons of 202 potential drug target genes (including 50 bp 
flanking each exon). For this study, we analyzed 195 autosomal 
genes, and focused on variants identified in individuals of Euro- 
pean descent (N= 12,515). We defined rare variants as those with 
a DAF £ 10~ 4 . We oriented all variants along the human ancestral 
sequence, as defined by members of The 1000 Genomes Project 
Consortium (http://ftp.1000genomes.ebi.ac.uk/voll/ftp/phasel/ 
analysis_results/supporting/ancestral_alignments/; date accessed: 
January 3, 2012). The ancestral allele definition relied on the four- 
way sequence alignment among human, chimpanzee, orangutan, 
and macaque genomes, and it estimated the ancestral state using 
a probabilistic phylogenetic model. We analyzed the impact of 
potential errors in ancestral allele definition on our results and 
found that even error rates on the order of 3%-5%, such as when 
naively using the chimpanzee reference allele as the ancestral 
allele, did not change our results (Supplemental Methods and 
Results). To minimize the potential confounding effects due to 
coverage, and to enrich for high-quality variants, we selected 
variant and invariant sites with >10x coverage using per-site 
coverage data from a random sample of 500 individuals reported 
by Nelson et al. (2012). 

We subdivided variants into seven distinct subtypes based on 
the ancestral and derived alleles: AT > GC, GC > AT (non-CpG), 
CpG GC > AT, AT > CG, GC > TA, AT > TA, and GC > CG. GC > AT 
transitions that occurred at an ancestral CpG site (CpG GC > AT) 
were analyzed separately from other GC > AT variants because 
hypermethylation of the cytosine base at CpG dinucleotides leads 
to spontaneous deamination, resulting in C > T and G > A transi- 
tions that occur with substantially higher rates than other sub- 
types (Nachman and Crowell 2000; Kondrashov 2003; Hwang and 
Green 2004). In addition, previous studies found that substitution 
rates at CpG dinucleotides are more strongly negatively correlated 
with GC content and recombination rate compared with non- 
CpG-induced GC > AT transitions, suggesting that different mo- 
lecular mechanisms may be involved (Arndt et al. 2005; Duret and 
Arndt 2008). 

GC > TA and GC > CG variants at CpG sites, which make up 
the eighth and ninth variant subtypes, were analyzed separately 
from non-CpG-induced GC > TA and GC > CG variants. They were 
modeled in the multinomial logistic regressions with CpG as the 
ancestral base (see below). As there are relatively few observed 
variants of these two subtypes (—200 each in our data set), it is 
difficult to accurately analyze mutation patterns and we did not 
report these results. These variants, however, are included in the 
GC > TA and GC > CG variant subtypes presented in Table 1, and 
included when analyzing all variant subtypes combined. 

Per-gene mutation rates and genomic context 

We analyzed mutation rates calculated for 193 of the 195 autoso- 
mal genes (two genes were excluded due to low numbers of vari- 
ants), as described previously (Nelson et al. 2012). For each of the 
193 genes, we calculated the average GC content and sex-averaged 
pedigree-based recombination rates (Kong et al. 2002) within the 
transcribed region of each gene based on definitions in RefGene. 



Linear regression was performed in R (R Development Core Team 
2008). 

Sampling of intergenic regions to obtain common variants 
and substitutions 

To sample common variants and substitutions from random ge- 
nomic intervals with the least selective pressure, we first defined 
intergenic regions by masking all genie regions ±1 kb of the 
transcription start and end site of any gene based on RefGene in 
hgl8. We then removed all regions that were not uniquely aligned 
in the four-way alignments between human, chimpanzee, orang- 
utan, and macaque (ftp://ftp.ensembl.org/pub/release-54/emf/ 
ensembl-compara/epo_4_catarrhini/; date accessed: December 7, 
2011). To match the distribution of genomic features with the 
rare variant data as closely as possible, we sampled 32,279 auto- 
somal regions from all possible regions according to their geno- 
mic parameters. Specifically, we matched the size distribution as 
well as the joint distribution of GC content and recombination 
rate (Kong et al. 2002) of the selected regions to those of the 
target regions in the exome sequencing of the 202 genes. We used 
these regions to sample common variants. Because there were 
substantially more substitutions in these regions than common 
variants, we randomly subsampled 12,034 of the 32,279 regions 
to obtain substitutions. The median and standard deviation of 
GC content across the assayed regions was 0.49 ± 0.12, 0.47 ± 
0.11, and 0.47 ± 0.12 for rare variants, common variants, 
and substitutions, respectively. The median and standard de- 
viation for recombination rate (log-transformed, in units of 
cM/Mb) was 0.29 ±0.17 for rare variants, common variants, and 
substitutions. 

Common variant data 

Single-nucleotide variants from the interim phase 1 haplotype 
data from The 1000 Genomes Project Consortium were used 
to assemble a data set of common variants. The frequency file 
for the European subset (N = 381) of the data was downloaded 
from http://www.sph.umich.edu/csg/abecasis/MACH/download/ 
1000G-PhaseI-Interim.html; date accessed: December 20, 201 1. All 
variants within the selected regions, as described above, were ori- 
ented ancestral to derived. Successfully oriented variants with 
a DAF > 0.05 were categorized into the seven variant subtypes and 
they form the common variant data set. 

Substitution data 

Substitutions between human and chimpanzee were obtained 
using the four-way alignments between human, chimpanzee, 
orangutan, and rhesus macaque (ftp://ftp.ensembl.org/pub/ 
release-54/emf/ensembl-compara/epo_4_catarrhini/; date accessed: 
December 7, 2011). To identify substitutions, only regions where 
there was a unique human, chimp, and orangutan alignment were 
used. Single-base human-chimpanzee differences were sampled 
from the 12,034 intergenic regions as described above. All sites were 
oriented along the ancestral lineage and categorized into the seven 
variant subtypes. Variant sites where the human lineage base rep- 
resents the ancestral allele were excluded. 

ESP rare variants 

Variants from the NHLBI Exome Sequencing Project (ESP) from 
5400 individuals were downloaded from the Exome Variant Server 
(Exome Variant Server, NHLBI ESP, Seattle, WA, URL: http://evs. 
gs.washington.edu/EVS/; date accessed: December 2, 2011]). We 
also utilized sequence coverage data downloaded from the Exome 
Variant Server (date accessed: December 2, 2011 and December 5, 
2011) to select sites with >10x coverage. Subsequent analysis 



1982 Genome Research 

www.genome.org 



Mutation patterns revealed by rare variants 



focused on singleton variants (DAF = 1.4 x 1CT 4 ) identified in 
Europeans (N = 3510). Variants were oriented along the ancestral 
allele, as before. 

Logistic regression analysis 

We used a logistic regression framework to model the effect of GC 
content, recombination rate, and distance to recombination hot- 
spot on the occurrence of rare variants, common variants, and 
substitutions. We defined GC content at a given site as the per- 
centage of GC bases in a 1-kb window surrounding the site (500 bp 
upstream, 500 bp downstream) based on the human genome ref- 
erence sequence (hgl8). We calculated the average recombination 
rate in a 1-kb window surrounding each site using both the 2002 
deCODE sex-averaged recombination rates (Kong et al. 2002) and 
the 2010 sex-averaged recombination maps (Kong et al. 2010). The 
absolute distance to the center of the nearest recombination hot- 
spot was calculated for each site using recombination hotspot co- 
ordinates from Phase II of the HapMap Project (McVean et al. 2004; 
Myers et al. 2005). These same definitions of recombination hot- 
spots were used to define sites that fell inside and outside of 
recombination hotspots. We excluded sites if they were within 
repeats as defined by RepeatMasker. Recombination rates and 
distances to hotspots were log-transformed to more closely re- 
semble a normal distribution. 

To examine the impact on total mutation (all subtypes 
combined), we regressed the logit of the probability of a site 
containing a rare variant of any subtype against GC content, re- 
combination rate, or DTH using separate logistic regression models 
for each genomic context variable. Each logistic regression has 
the form 

ln (w) = " + /3Z ' 

where p is the probability that the site contains a rare variant and 
z is either GC content, recombination rate, or DTH at that site. We 
assessed the significance of the regression using a Wald test on the 
/3 parameter. We fit similar regression models for common variants 
and substitutions. 

Next, to analyze the effect of genomic context on specific 
variant subtypes, we employed a multinomial logistic regression 
model that jointly analyzes the probability of all possible variant 
subtypes for a given ancestral state. Additional details are in the 
Supplemental Methods. Logistic and multinomial regression was 
performed in R (R Development Core Team 2008), using the mlogit 
package for multinomial regression. 
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